Distortion-Resistant Hashing for rapid search of similar DNA subsequence

نویسنده

  • Jarek Duda
چکیده

One of the basic tasks in bioinformatics is localizing a short subsequence S, read while sequencing, in a long reference sequence R, like the human geneome. A natural rapid approach would be finding a hash value for S and compare it with a prepared database of hash values for each of length |S| subsequences of R. The problem with such approach is that it would only spot a perfect match, while in reality there are lots of small changes: substitutions, deletions and insertions. This issue could be repaired if having a hash function designed to tolerate some small distortion accordingly to an alignment metric (like Needleman-Wunch): designed to make that two similar sequences should most likely give the same hash value. This paper discusses construction of Distortion-Resistant Hashing (DRH) to generate such fingerprints for rapid search of similar subsequences. The proposed approach is based on the rate distortion theory: in a nearly uniform subset of length |S| sequences, the hash value represents the closest sequence to S. This gives some control of the distance of collisions: sequences having the same hash value.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Modified DNA Extraction for Rapid PCR Detection of Methicillin-Resistant Staphylococci

Nosocomial infection caused by methicillin-resistant staphylococci poses a serious problem in many countries. The aim of this study was to rapidly and reliably detect methicillin-resistant-staphylococci in order to suggest appropriate therapy. The presence or absence of the methicillin-resistance gene in 115 clinical isolates of Staphylococcus aureus and 50 isolates of Coagulase Negative Staphy...

متن کامل

RHash: Robust Hashing via `∞-norm Distortion

Hashing is an important tool in large-scale machine learning. Unfortunately, current data-dependent hashing algorithms are not robust to small perturbations of the data points, which degrades the performance of nearest neighbor (NN) search. The culprit is the minimization of the `2-norm, average distortion among pairs of points to find the hash function. Inspired by recent progress in robust op...

متن کامل

A Single Index Approach for Distortion-Free Time-Series Subsequence Matching

In this paper we propose a new method for distortionfree time-series subsequence matching. Our method is distortion-free in the sense that it performs preprocessing on time-series to remove the distortions of offset translation and amplitude scaling at the same time. We call this preprocessing as normalization transform in this paper. Previous work on the normalization-transformed subsequence m...

متن کامل

Approximate Substructure Searchin a Database of 3 D Graphs

Given a database D of three dimensional (3D) graphs and a query graph Q, the problem of substructure search is deened as nding the graphs in D that contain Q. This is an important search operation in scientiic databases. This paper extends the search operation to nd those graphs D in D that \approximately" contain Q in the presence of rotation , translation, distortion, and node insert/delete i...

متن کامل

Detection of Isoniazid-Resistant Clinical isolates of Mycobacterium tuberculosis from India using Ser315Thr marker by Comparison of molecular methods

In this study, Substitution at codon Ser315 of katG gene, a reliable marker for isoniazid (INH) resistance was analyzed and compared by three molecular methods such as DNA  sequencing, polymerase chain reaction restriction fragment length polymorphism (PCR-RFLP) and PCR-single strand conformation polymorphism (PCR-SSCP) in 105 phenotypically resistant isolates obtained from various parts of Ind...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1602.05889  شماره 

صفحات  -

تاریخ انتشار 2016